Whole genome DNA and RNA sequencing of whole blood elucidates the genetic architecture of gene expression underlying a wide range of diseases

To create a scientific resource of expression quantitative trail loci (eQTL), we conducted a genome-wide association study (GWAS) using genotypes obtained from whole genome sequencing (WGS) of DNA and gene expression levels from RNA sequencing (RNA-seq) of whole blood in 2622 participants in Framingham Heart Study. We identified 6,778,286 cis-eQTL variant-gene transcript (eGene) pairs at p < 5 × 10–8 (2,855,111 unique cis-eQTL variants and 15,982 unique eGenes) and 1,469,754 trans-eQTL variant-eGene pairs at p < 1e−12 (526,056 unique trans-eQTL variants and 7233 unique eGenes). In addition, 442,379 cis-eQTL variants were associated with expression of 1518 long non-protein coding RNAs (lncRNAs). Gene Ontology (GO) analyses revealed that the top GO terms for cis-eGenes are enriched for immune functions (FDR < 0.05). The cis-eQTL variants are enriched for SNPs reported to be associated with 815 traits in prior GWAS, including cardiovascular disease risk factors. As proof of concept, we used this eQTL resource in conjunction with genetic variants from public GWAS databases in causal inference testing (e.g., COVID-19 severity). After Bonferroni correction, Mendelian randomization analyses identified putative causal associations of 60 eGenes with systolic blood pressure, 13 genes with coronary artery disease, and seven genes with COVID-19 severity. This study created a comprehensive eQTL resource via BioData Catalyst that will be made available to the scientific community. This will advance understanding of the genetic architecture of gene expression underlying a wide range of diseases.

www.nature.com/scientificreports/ Expression quantitative trait locus (eQTL) analysis seeks to identify genetic variants that affect the expression of local (cis) or distant (trans) genes (eGenes). Until recently, eQTL analysis has relied on high throughput microarray technologies and spawned a wave of genome-wide eQTL studies [6][7][8][9][10][11] including a recent study from our group 12 . These studies aided the understanding of the functional relevance of many GWAS results. Importantly, a hypothesis-free genome-wide eQTL approach permits the identification of new putatively functional loci without requiring previous knowledge of specific regulatory regions.
Most previous eQTL analyses were limited by small sample sizes and by the imprecision of microarrays. Newer technologies of RNA sequencing (RNA-seq) and whole genome sequencing (WGS) of DNA add greater precision and relevance to eQTL analyses. In conjunction with the National Heart, Lung, and Blood Institute's (NHLBI) Trans-Omics for Precision Medicine (TOPMed) Program 13 , the Framingham Heart Study (FHS) has obtained whole genome sequencing (WGS) in ~ 6100 study participants to help understand the molecular basis of heart, lung, blood, and sleep disorders and to advance precision medicine. Among FHS participants with WGS, RNA-seq was obtained in 2622 participants. We conducted genome-wide eQTL analyses using high-precision genotypes obtained via WGS and gene expression levels from RNA-seq of whole blood. The primary objectives of this study were three-fold. Firstly, it sought to provide a scientific resource of cis and trans gene-level eQTL data to facilitate understanding of the genetic architecture of gene expression traits. Secondly, it was aimed to provide eQTL data for long noncoding RNAs (lncRNAs) that were not captured in prior array-based eQTL studies. Thirdly, it attempted to demonstrate the utility of the eQTL resource in causal inference analyses.

Results
Of the 2622 FHS participants in eQTL analyses, 720 participants were from the FHS Offspring cohort (mean age 71 ± 8 years; 59% women) and 1902 were from the Third Generation cohort (mean age 47 ± 8 years; 52% women) (Supplemental Table 1). We used 19,624,299 SNPs with a minor allele count (MAC) ≥ 10 and 58,870 expression levels in association analyses to identify gene-level eQTLs. We evaluated the genomic inflation factor (λ GC ). The observed λ GC = 1.03, indicating that inflation was unlikely for the eQTL analyses (Supplemental Fig. 1).
Gene-level eQTL results. cis-eQTLs. Cis-eQTLs was defined as SNPs within 1 Mb of the transcription start sites (TSSs) of targeting genes. We identified 6,778,286 significant cis-eQTL variant-eGene pairs from 2,855,111 unique cis-eQTL variants and 15,982 unique eGenes (at p < 5 × 10 -8 ) (Supplemental Table 2). The median number of cis-eQTL variants per gene was 183 (interquartile range = 47,463). The eGenes harboring the largest numbers of cis-eQTL variants are located in the human leukocyte antigen (HLA) or major histocompatibility complex (MHC) on chromosome 6, reflecting a large number of SNPs in strong linkage disequilibrium (LD) at the MHC locus 14 . Owing to the computational burden, we selected the strongest cis-eQTL variant (i.e., the lead variant) as that which had the lowest p-value per eGene. If several cis-eQTLs displayed the same p-value (i.e., they are in perfect LD, r 2 = 1), we randomly select one lead eQTL variant per eGene (Supplemental Table 3) and the top 25 pairs was displayed in Table 1. Of the 15,982 significant unique cis-eQTL variant-eGene pairs, 82.8% (n = 13,236) of SNPs were within 100 kb of the transcription start sites (TSSs) of the respective eGenes, 9.3% (n = 1486) within 101 kb-200 kb region, 5.7% (n = 909) within 201 kb-500 kb region, and 2.2% (n = 351) within 501 kb-1 Mb (Fig. 1A). Published GWAS and QTL analyses revealed that rare variants have larger effect sizes than common variants 6,15 . Therefore, we compared the median, 25th percentile, and 75th percentile of the absolute values of effect sizes for lead cis-eQTL variants across four variant groups based on minor allele frequencies (MAFs). We found that rare cis-eQTL variants displayed larger effect sizes (median effect size 0.44 versus 1.77 for cis-QTL variants with MAF in 0.1-0.5 versus cis-QTL variants with MAF in 0.003-0.01) (trend test P < 0.001) ( Table 2).
trans-eQTLs. Trans-eQTLs referred to the SNPs that were beyond of 1 Mb of the TSSs of the eGenes on the same chromosome or those on the different chromosomes of the eGenes. We identified 1,469,754 significant trans-eQTL variant-eGene pairs (p < 1e−12) from 526,056 unique trans-eQTL variants and 7233 trans-eGenes (Table 3, Supplemental Table 4). The median number of significant-eQTL variants per eGene was 11 (interquartile range = 2, 76) 14 . With the same method used to select the lead cis-eQTL variants, we selected the lead trans-eQTL variant based on p-values for each trans-eGene, yielding 7233 unique trans-eQTL-eGene pairs (Supplemental Table 4). We further compared the effect sizes of the lead trans-eQTL variants based on their MAF. We found that rare trans-eQTL variants (MAF in 0.003-0.01) displayed larger effect sizes (median effect size 0.42 versus 2.38 for common trans-QTL variants (MAF in 0.1-0.5) (trend test P < 0.001) ( Table 2).

Replication analyses. The Battle study only provided p values for eQTL analyses. Of the reported 10,914
cis-eQTL-eGene pairs from the study by Battle et al. 23 (FDR < 0.05) 23 , 6782 (62%) pairs displayed p < 5e−8 in the present study. The average proportion of variance explained by these 6782 cis-eQTL variants in respective genes was 0.11 (Supplemental Table 12). Of the 269 trans-eQTL-eGene pairs (FDR < 0.05) reported by Battle et al. 23 47 (18%) pairs displayed p < 1e−12 in the current study. The average proportion of variance explained by these 47 trans-eQTL variants in respective genes was 0.076. Of note, all 47 trans-eQTL variants and respective trans-eGenes are located on the same chromosomes (Supplemental Table 13). The average distance between these trans-eQTL variants and respective trans-eGenes is within 22 Mb. We conducted additional replication analysis for the cis-eQTL variant-eGene pairs generated from 8,372,247 SNPs and 20,188 gene transcripts that were common to our study (n = 2622 participants) and to GTEx(6) (n = 755 participants). At p < 5e−8, we identified 1,080,485 cis-eQTL variant-eGene pairs in GTEx and 3,852,182 pairs in our study; of these, 951,085 pairs (88% of pairs in GTEx) displayed the same effect direction as in our larger study. (Supplemental Fig. 3). At p < 1e−4, we identified 1,815,208 cis-eQTL variant-eGene pairs in GTEx and 6,364,173 pairs in this study; of these, 1,797,977 (99% of pairs in GTEx) displayed the same effect directionality with our study. As can be seen in the figure, there is considerable concordance between the Framingham Heart Study (FHS) and GTEx eQTL effect sizes, although the FHS has a larger sample size than GTEx whole blood samples (2622 vs 755), which results in a smaller standard error and larger t-statistics (Supplemental Fig. 4).

Discussion
We leveraged WGS and RNA-seq data from 2622 FHS participants to create a powerful scientific resource of eQTLs. We identified significant unique cis-eQTL variants-eGene pairs (n = 2,855,111 unique variants with cis-15,982 eGenes) and 526,056 unique trans-eQTL variants-eGene pairs (526,056 unique variants and unique 7233 trans-eGenes. A large proportion of reported cis-eQTL variant-eGene pairs were replicated with directionally concordant in our study including 88% of cis-variant-eGene pairs from GTEx. Consistent with our previous study and others [7][8][9][10][11][12]24,25 , 90% of eQTL variants identified in the present study are located within 1 Mb of the corresponding cis-eGene and 83% are within 100 kb of the TSSs of the corresponding eGene. While the majority of (85% of cis-and 96% of trans-) lead eQTL variants explained only a small proportion (R 2 < 0.2) of interindividual variation in expression of the corresponding eGenes, 15% of lead cis-eQTL variants and 4% of lead trans variant explained 20% or more of interindividual variation in expression of the corresponding eGenes 26 . Additionally, eQTL variants were enriched (p < 0.0001) in disease-associated SNPs identified by GWAS. We further demonstrated the utility of our eQTL resource for conducting causal inference testing. Our MR analyses revealed putatively causal relations of gene expression to several disease phenotypes including SBP, CAD, and COVID-19 severity. Taken together, the comprehensive eQTL resource we provide can advance understanding of the genetic architecture of gene expression underlying a wide variety of diseases. The interactive and browsable eQTL resource will be posted to the National Heart, Lung, and Blood Institute's BioData Catalyst site and will be freely accessible to the scientific community.
Our study expands current knowledge by creating an accessible and browsable resource of eQTLs based on WGS and RNA-seq technologies. It also includes eQTLs for lncRNAs that were not reported in prior eQTL studies that used array-based expression profiling. Over the past decade, accumulating evidence shows that lncRNAs are widely expressed and have key roles in gene regulation 27,28 . It is estimated that the human genome contains 16,000 to 100,000 lncRNAs 27 . We identified 447,598 cis-eQTL variants for 1518 cis-lncRNAs and 121,241 trans-eQTLs for 475 trans-lncRNAs (Supplemental Tables 5 and 6). In addition, we identified six lncRNAs that showed putative causal associations with SBP. However, the functions of these six lncRNAs remain to be determined. Thus, our novel eQTL database may also help in the study of non-protein-coding RNAs in relation to health and disease.
As a proof of concept of the application of the eQTL resource, we performed MR analyses on a small number of cardiovascular traits and COVID-19 severity and demonstrated that the eQTL database can identify promising candidate genes with evidence of putatively causal relations to disease that may merit functional studies. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread across the globe and caused millions of deaths since it emerged in 2019. Recent GWAS of COVID-19 susceptibility and severity [29][30][31] have identified SNPs in several loci on chromosomes 3, 9 and 21 32 . Using our eQTL resource in conjunction with COVID-19 GWAS, we conducted MR analyses that identified seven genes, including OAS1 and IFNAR2, as putatively causal for COVID-19 severity. The OAS1/2/3 cluster has been identified as a risk locus for COVID-19 severity 29 . This area harbors a protective haplotype of approximately 75 kilo-bases (kb) at 12q24.13 among individuals of European ancestry 21 . A recent study identified an alternative splicing variant, rs10774671, at exon 7 of OAS1 for which the protective allele "G" leads to a more active OAS1 enzyme 22 . Our MR results suggest that both the OAS1 gene expression level and its splice variation are causal for COVID-19 severity.
The IFNAR2 gene encodes a protein in the type II cytokine receptor family. Mutations in IFNAR2 are associated with Immunodeficiency and measles virus susceptibility and play an essential and a narrow role in human antiviral immunity 33 . A recent study further showed that loss-of-function mutations in IFNAR2 are associated with severe COVID-19 34 . These studies, considered alongside our MR results provide evidence of a causal role of IFNAR2 expression in severe COVID-19 infection.
This study identified cis-eQTL in whole blood and used them as IVs in MR analyses. Therefore, caution is needed in interpreting the causal relations of several genes to several disease traits. Blood tissue is more easily accessible than other tissues, e.g., kidney or heart, for large association studies. Several previous studies of omics www.nature.com/scientificreports/ data have shown that findings in whole blood were compariable to other tissues 35,36 . Several eQTL studies have also demonstrated cis-and trans-eQTLs gene regulation across tissues. The GTEx Consortium found that pervasive cis-eQTLs affect the majority of human genes across tissues 6,37 . Another study also showed that the identified cardiometabolic genetic loci share downstream cis-and trans-gene regulation across tissues and diseases 38 . This study has several noteworthy limitations. This study included White participants of European ancestry who were middle-aged and older; therefore, the eQTLs identified may not be generalizable to other races or age ranges. The current RNA-seq platform included ~ 7700 lncRNAs, which is a modest subset of all lncRNAs in the human genome 27 . We used MR analyses to infer causal relation of genes to disease traits. MR analysis is predicated on a set of critical assumptions that may not be testable in the setting of eQTL analysis 39,40 . Replication of our eQTL findings is warranted in studies with larger sample sizes and more diverse populations. In addition, our study found that the cis-eQTL variants were enriched for GWAS-associated SNPs. It is possible that we may have underestimated signal regions when local LD in the GWAS region was not considered. An exhaustive investigation of colocalization 41,42 is not feasible in this study due to computational burden and storage limitations.
Our study also has several strengths. The advent of high-throughput RNA sequencing technology provides an unparalleled opportunity to accelerate understanding of the genetic architecture of gene expression. Our study extends and expands the existing literature by identifying novel eQTLs based on WGS and RNA-seq. We demonstrate the potential applications of a vast eQTL resource by analyzing the concordance of eQTL variants with SNPs from GWAS of several disease phenotypes followed by causal inference analyses that identified promising disease-related genes that may merit functional studies. We created an open and freely accessible eQTL repository that can serve as a promising scientific resource to better understand of the genetic architecture of gene expression and its relations to a wide variety of diseases.

Methods
Study participants. This study included participants from the FHS Offspring 10 and Third Generation cohorts 11 . Blood samples for RNA seq were collected from Offspring participants who attended the ninth examination cycle (2011-2014) and the Third Generation participants who attended the second examination cycle (2008)(2009)(2010)(2011). Protocols for participant examinations and collection of genetic materials were approved by the Institutional Review Board at Boston Medical Center. All participants provided written, informed consent for genetic studies. All research was performed in accordance with relevant guidelines/regulations. Isolation of RNA from whole blood and RNA-seq. Peripheral whole blood samples (2.5 mL) were collected from FHS participants (Offspring participants at the ninth examination cycle and the Third Generation participants at the second examination cycle) using PAXgene™ tubes (PreAnalytiX, Hombrechtikon, Switzerland), incubated at room temperature for 4 h for RNA stabilization, and then stored at − 80 °C until use. Total RNA was isolated using a standard protocol using a PAXgene Blood RNA Kit at the FHS Genetics Laboratory (FHS Third Generation cohort) and the TOPMed contract laboratory at Northwest Genomics Center (Offspring cohort). Tubes were allowed to thaw for 16 h at room temperature. White blood cell pellets were collected after centrifugation and washing. Cell pellets were lysed in guanidinium-containing buffer. The extracted RNA was tested for its quality by determining absorbance readings at 260 and 280 nm using a NanoDrop ND-1000 UV spectrophotometer. The Agilent Bioanalyzer 2100 microfluidic electrophoresis (Nano Assay and the Caliper LabChip system) was used to determine the integrity of total RNA.
All RNA samples were sequenced by an NHLBI TOPMed program 13 reference laboratory (Northwest Genomics Center) following the TOPMed RNA-seq protocol. All RNS-seq data were processed by University of Washington. The raw reads (in FASTQ files) were aligned using the GRCh38 reference build to generate BAM files. RNA-SeQC 43 was used for processing of RNA-seq data by the TOPMed RNA-seq pipeline to derive standard quality control metrics from aligned reads. Gene-level expression quantification was provided as read counts and transcripts per million (TPM). GENCODE 30 annotation was used for annotating gene-level expression.
Whole blood cell counts. Whole blood cell counts include white blood cell (WBC) count, red blood cell count, platelet count, and WBC differential percentages (neutrophil percent, lymphocyte percent, monocyte percent, eosinophil percent, and basophil percent). Contemporaneously measured blood cell counts were available in 2094 (80%) of the 2622 FHS participants used in eQTL analyses. We performed partial least squares (PLS) prediction method 44 with three-fold cross-validation (2/3 samples for training and 1/3 for validation) to impute these blood cell components using gene expression from RNA-seq. Prediction accuracy (R-squared) varied across blood component: WBC: 0.58, platelet: 27%, neutrophil percentage: 82%, lymphocyte percentage: 85%, monocyte percentage: 77%, eosinophil percentage: 87%, basophil percentage: 32%. Because 80% of the participants in this study had directly measured cell count variables and only 20% received imputed variables, we used the measured (in 2094 participants) and predicted (in 528 participants) blood cell components as covariates in regression models for eQTL analyses.
RNA-seq quality control, and data adjustment. To minimize confounding, expression residuals were generated by regressing transcript expression level on age, sex, measured or predicted blood cell count and differential cell proportions, and genetic principal components. Principal component (PC) analysis is a technique for reducing the dimensionality in large data sets 45 . It has been widely used in regression analyses to minimize unknown confounding. We included five PCs computed from FHS genotype profiles to account for population stratification. We also included 15 PCs computed from the transcriptome profile to account for unknown confounders that may affect gene expression. In addition, we adjusted for a relatedness matrix, and technical Association analyses of expression levels with SNPs. We performed association analyses of expression levels with genome-wide SNPs with minor allele count (MAC) ≥ 10. In a simple regression model, a SNP was used as an independent variable and the residuals of a transcript expression level was used as the dependent variable. All analyses were performed on the NIH-supported STRIDES cloud infrastructure. A graphical Processing Unit (GPU)-based program 12 was used to facilitate computation. Effect sizes, standard error, partial R-squared, and p-values for all SNP-gene expression pairs with p < 1e−4 were stored to enable lookups and to facilitate later meta-analysis. We evaluated the genomic inflation factor for eQTL analyses. Due to storage burden, we evaluated the genomic inflation factor based on full eQTL analysis (i.e., no p value restriction) on chromosome 12 because the length of this chromosome is close to the median length of chromosome 1-22. In this study, we defined cis-eQTLs as targeting genes within 1 Mb of their transcription start site (TSS). Trans-eQTLs referred to those that were beyond of 1 Mb of the TSSs of the eGenes on the same chromosome or those on the different chromosomes of the eGenes. A significant cis-eQTL of an eGene was identified if a SNP within 1 Mb of that gene was associated with expression of a transcript of that gene at P < 5 × 10 -8 . A significant trans-eQTL was defined as a SNP beyond 1 Mb that gave rise to P < 1 × 10 -12 in association a gene.

Estimation of variance in expression level explained by eQTLs.
An accurate estimation of heritability may help understand the degree to which genetic factors influence a trait 49 . Narrow-sense heritability measures the proportion of phenotypic variance explained by additive genetic effects 17 . We estimated the proportion of variance (R 2 ) in the expression level of a gene that was explained by the lead cis-eQTL (cis-R 2 ) or trans-eQTL (trans-R 2 ) variant. We conducted Mood's median test (median_test in the "coin" R package) to compare the median value of variance in cis-eGenes and trans-eGenes explained by cis-eQTLs versus trans-eQTLs. For cis-or trans-eGenes, we compared the median value of genetic variance in protein-coding eGenes versus lncRNA eGenes.

Comparison of effect sizes of eQTLs with different minor allele frequencies. Previous studies
showed that rare variants showed a large effect size in QTL analysis. For significant cis-eQTLs (P < 5 × 10 -8 ) and trans-eQTLs (P < 1 × 10 -12 ), we compared the median (25% quartile, 75% quartile) of the absolute values of effect sizes for eQTL variants in four intervals based on their minor allele frequencies Gene Ontology analyses. We selected the single, most significant eQTL variant (i.e. lead variant) for each eGene (for the gene level analysis) from cis-and trans-eQTL results separately. The eGenes annotated to the selected lead cis and trans eQTL variants were matched into Entrez IDs. We used the "goana" function from the "limma" package 46 to test for over-representation of gene ontology (GO) terms or KEGG pathways applied to the top 1000 eGenes. We used FDR < 0.05 to report GO terms including Biological Process, Cellular Component, and Molecular Function. Enrichment analyses using GWAS Catalog. We linked the eQTL variants with SNPs from the GWAS Catalog 2 (data downloaded on October 22, 2021), which included 243,618 entries for 2960 mapped traits at p < 5e−8. Cis-and trans-eQTL variants were analyzed separately. Unique SNP RS IDs were used for enrichment analysis with Fisher's test. FDR < 0.05 was used for significance.
Correlation analysis of selected lncRNA and protein coding genes. For lncRNAs that were in the top 25 cis-eQTL variant-eGene pairs, we performed partial Pearson correlation analyses between the expression level of the lncRNA and its nearby protein coding gene, adjusting for the same set of covariates that were included in eQTL analysis. We performed random sampling of 1000 genes 500 times to derive null distributions of partial Pearson correlation of these gene pairs. We calculated an empirical p-value to evaluate whether the partial Pearson correlation coefficient between the expression level of an lncRNA and its nearby protein coding gene was significantly higher than the average partial Pearson correlation coefficient from randomly selected gene pairs. The empirical p-value was calculated as the proportion of partial Pearson correlation coefficients that were more extreme than the correlation coefficient of an lncRNA and its nearby protein coding gene.
Mendelian randomization analysis. We conducted Mendelian randomization (MR) to demonstrate the application of the eQTL resource in causal inference analysis. We tested for potential causal association of the cis-eGenes with SBP, coronary artery disease (CAD), and COVID-19 severity. SBP-associated SNPs were obtained